Written by: Duy Huynh
Clustering is the task of grouping objects into clusters, such that objects within the same cluster are as similar as possible. We basically split the complicated data into simpler chunks for analysis. Some use case examples include:
For Exploratory Data Analysis, Clustering is especially useful, as it helps to give insight into the basic characteristic of the data at hands, asking the question of underlying connectivity properties.
K-means is a centroid-based clustering technique. In layman term,
Compared to other clustering algorithm, k-means has the advantage of interpretability, as group features are directly comparable to the original features, it is easier to perform analysis on the result. It is important to be optimal with our algorithm; But, in a business point of view, being able to interpret and communicate our findings to others is also an invaluable objective.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image
df = pd.read_csv("data/E-commerce.csv")
df.head(10)
| ID | n_clicks | n_visits | amount_spent | amount_discount | days_since_registration | profile_information | |
|---|---|---|---|---|---|---|---|
| 0 | 2085 | 643 | 142 | 300.000000 | 228.432503 | 132 | 411 |
| 1 | 742 | 527 | 95 | 743.832508 | 60.882831 | 86 | 184 |
| 2 | 750 | 367 | 49 | 305.668886 | 72.961801 | 334 | 329 |
| 3 | 1224 | 466 | 30 | 291.191193 | 101.903170 | 131 | 148 |
| 4 | 2210 | 715 | 169 | 703.136878 | 506.416735 | 114 | 160 |
| 5 | 1632 | 204 | 84 | 312.331152 | 59.930122 | 274 | 295 |
| 6 | 2416 | 600 | 135 | 1008.169771 | 717.696496 | 165 | 28 |
| 7 | 903 | 316 | 115 | 823.874562 | 6.360026 | 184 | 146 |
| 8 | 111 | 224 | 115 | 800.000000 | 24.414520 | 324 | 0 |
| 9 | 1581 | 363 | 132 | 322.490806 | 67.122764 | 248 | 101 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2500 entries, 0 to 2499 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2500 non-null int64 1 n_clicks 2500 non-null int64 2 n_visits 2500 non-null int64 3 amount_spent 2500 non-null float64 4 amount_discount 2500 non-null float64 5 days_since_registration 2500 non-null int64 6 profile_information 2500 non-null int64 dtypes: float64(2), int64(5) memory usage: 136.8 KB
This is a E-commerce dataset showing advertisment and page interaction from 2500 customers
Features:
We will first visualize the data and try to gain more insight into what we are working with.
sns.pairplot(df.drop(columns=['ID'], axis=1), diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x213e572d240>
At first glance, we can see most attributes follow a normal distribution, except for "amount_spent" and "amount_discount". We will seperate them into a jointplot for analysis.
sns.jointplot(y=df['amount_spent'], x=df['amount_discount'])
<seaborn.axisgrid.JointGrid at 0x213eba7ea40>
Looking at the visualization, we can conclude that:
To make things simple, we first implement clustering based only 2 axis: "n_clicks", and "amount_spent", since the clusters from these 2 features are pretty obvious.
c_df = df.loc[:,["amount_spent","n_visits"]]
sns.scatterplot(x=df['amount_spent'],y=df['n_visits'])
<AxesSubplot:xlabel='amount_spent', ylabel='n_visits'>
If we use just our human eyes, we can clearly see 2 seperate groups derived from the scatter plot. One with high spending amount but low clicks, and high amount of clicks but low spending amount
K-means clustering method functions by minimizing intra-variance between each group, which requires the calculation of distance between each pair of observations. Specifically in this note, we will use Euclidean distance. $d\left( x,y\right) = \sqrt {\sum _{i=1}^{n} \left( y_{i}-x_{i}\right)^2 }$
Using this formula, every columns should be normalized, so that the value difference between 2 columns will not skew the result. (amount_spent's values are generally a lot higher than n_visits). Normalization will be performed using min-max scaler.
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
scaler = MinMaxScaler()
scaler.fit(c_df)
X = pd.DataFrame(scaler.transform(c_df))
X.columns = c_df.columns
X
| amount_spent | n_visits | |
|---|---|---|
| 0 | 0.045680 | 0.530120 |
| 1 | 0.113261 | 0.341365 |
| 2 | 0.046543 | 0.156627 |
| 3 | 0.044339 | 0.080321 |
| 4 | 0.107065 | 0.638554 |
| ... | ... | ... |
| 2495 | 0.506841 | 0.622490 |
| 2496 | 0.302396 | 0.064257 |
| 2497 | 0.142662 | 0.301205 |
| 2498 | 0.061654 | 0.204819 |
| 2499 | 0.218302 | 0.602410 |
2500 rows × 2 columns
As mentioned above, k is a free parameter and have to be chosen, which now raise the question of "What's the optimal number for k?"
We want the amount of cluster to be big enough to minimize the variance within each group, but not too large that it creates artificial boundaries within a group.
There exist many methods to figure out k:
In this note we will use Elbow to determine k. It is easier to explain the method with visualization
def DrawElbow(data):
intra_variance=[]
for i in range(1,11):
km = KMeans(n_clusters=i,
max_iter=300)
km.fit(data)
intra_variance.append(km.inertia_)
intra_variance=pd.Series(intra_variance)
return intra_variance
wcss = DrawElbow(X)
font = {'family': 'serif',
'color': 'darkred',
'weight': 'normal',
'size': 16,
}
plt.ylabel('Sum of Squared Distances')
plt.xlabel('Clusters')
plt.text(2,wcss[2]+5,"Elbow", ha='left',va = "bottom",fontdict=font)
wcss.plot(linestyle='--', marker='o', color='b', label='line with marker')
<AxesSubplot:xlabel='Clusters', ylabel='Sum of Squared Distances'>
If you can guess, we iterate through a range of k, and determine at what value of k do we stop seeing a huge drop for intra-variance, aka the "elbow".
As shown in the chart, sum of Squared Distance (or intra-variance) see a steep drops, up until 2 clusters. So, k=2 is our "elbow", and also the optimal amount of cluster. So now we implement k-means on the data with k=2.
km = KMeans(n_clusters=3,
max_iter=300)
km.fit(X)
c_df['label'] = km.labels_
plt.figure(figsize=(20,10))
sns.scatterplot(x=c_df['amount_spent'], y=c_df["n_visits"], hue=c_df['label'],palette="flare")
<AxesSubplot:xlabel='amount_spent', ylabel='n_visits'>
Unlike what we suspected, the data is classfied into 3 clusters
With this in mind, we can perform analysis seperately on each group, and make action accordingly. Here we only clustered 2 features so we will hold off analysis for now and expand the process to all features in our dataset first.
Scaling all features in the data
df = df.drop(columns=['ID'], axis=1)
scaler.fit(df)
df_scaled = pd.DataFrame(scaler.transform(df))
df_scaled.columns = df.columns
df_scaled
| n_clicks | n_visits | amount_spent | amount_discount | days_since_registration | profile_information | |
|---|---|---|---|---|---|---|
| 0 | 0.495819 | 0.530120 | 0.045680 | 0.094067 | 0.256809 | 0.702564 |
| 1 | 0.398829 | 0.341365 | 0.113261 | 0.025071 | 0.167315 | 0.314530 |
| 2 | 0.265050 | 0.156627 | 0.046543 | 0.030045 | 0.649805 | 0.562393 |
| 3 | 0.347826 | 0.080321 | 0.044339 | 0.041963 | 0.254864 | 0.252991 |
| 4 | 0.556020 | 0.638554 | 0.107065 | 0.208539 | 0.221790 | 0.273504 |
| ... | ... | ... | ... | ... | ... | ... |
| 2495 | 0.151338 | 0.622490 | 0.506841 | 0.061074 | 0.305447 | 0.536752 |
| 2496 | 0.159699 | 0.064257 | 0.302396 | 0.002112 | 0.428016 | 0.352137 |
| 2497 | 0.427258 | 0.301205 | 0.142662 | 0.060384 | 0.186770 | 0.338462 |
| 2498 | 0.248328 | 0.204819 | 0.061654 | 0.049120 | 0.377432 | 0.200000 |
| 2499 | 0.646321 | 0.602410 | 0.218302 | 0.468710 | 0.527237 | 0.396581 |
2500 rows × 6 columns
Using elbow method to determine k
wcss = DrawElbow(df_scaled)
plt.ylabel('Sum of Squared Distances')
plt.xlabel('Clusters')
wcss.plot(linestyle='--', marker='o', color='b', label='line with marker')
<AxesSubplot:xlabel='Clusters', ylabel='Sum of Squared Distances'>
At 3 clusters, the intra-variance starts decreasing in a somewhat linear fashion, so we choose 3 as our optimal k
km = KMeans(n_clusters=3,
max_iter=300)
km.fit(df_scaled)
df['label'] = km.labels_+1
df_scaled['label'] = km.labels_+1
With 3 group fully distinguised, we can now perform both horizontal comparisons between each group, and vertical analysis seperately for one group at a time. This reduces the complexity of our data, and emphasizes meaningful underlying structures. An obvious business use case example would be "If we have an advertisement program, which group of user should we target?"
First we have to visualize the lables. Because we have 6 features, it is a lot harder to fully visualize the data on a 2d or 3d planes. The best we can do is parallel coordinate, which is a common visualization for high-dimentional data.
df.groupby('label').mean()
| n_clicks | n_visits | amount_spent | amount_discount | days_since_registration | profile_information | |
|---|---|---|---|---|---|---|
| label | ||||||
| 1 | 386.411858 | 89.947036 | 646.966945 | 145.221189 | 201.683794 | 196.807115 |
| 2 | 586.822727 | 126.122727 | 1404.592274 | 1123.615193 | 202.342424 | 198.090909 |
| 3 | 253.193043 | 68.113043 | 3247.448308 | 79.966540 | 197.840000 | 213.739130 |
Here we see that "days_since_registration" and "profile_information" are both very similar throughout the groups. So we should drop them for the sake of simplicity.
df_dropped = df.drop(columns=['days_since_registration', "profile_information"],axis=1)
fig = px.parallel_coordinates(df_dropped, color="label", dimensions=['n_clicks','amount_spent', 'amount_discount' ,'n_visits'],
color_continuous_scale=px.colors.sequential.Aggrnyl, width=1100, height=800)
fig.write_image("fig1.png")
Image(filename='fig1.png')
At first look the Parallel Coordinates plot might seem a bit convoluted, but it can still give us good information and is a good place to start for the analysis. Looking at the plot, we can see:
fig, axes = plt.subplots(1, 2, figsize=(20,10))
sns.distplot(df[df['label']==1].amount_spent, ax=axes[0]).set(title='Spending Group 1')
sns.distplot(df[df['label']==1].n_clicks, ax=axes[1]).set(title='interaction Group 1')
[Text(0.5, 1.0, 'interaction Group 1')]
As an example of univariate analysis, we will draw distribution plot of spending and interaction for group 3. We can see clearly that both variance roughly follow a normal distribution and don't have noticable outliers (maybe except for the small right-tail). The average spending for the group is about 550, which is only 1/3 compared to group 2. A good amount of user in this group has 0 spending, so we can concludes lots of them are just window-browsing.
From these conclusions, maybe a markerter with more domain knowledge will have a suitable strategy to aim at these type of customers.
In this note we discussed the definition of clustering, and expand on k-means as a clustering method. We implemented k-mean clustering by first scaling our features, choosing the optimal k through the "elbow method", and then interpreted our results with additional analysis.
There were some important elements outside the scope of note that were not implicitly mentioned, including: